Clustering Research across Tibetan and Chinese Texts
نویسندگان
چکیده
Tibetan text clustering has potential in Tibetan information processing domain. In this paper, clustering research across Chinese and Tibetan texts is proposed to benefit Chinese and Tibetan machine translation and sentence alignment. A Tibetan and Chinese keyword table is the main way to implement the text clustering across these two languages. Improved Kmeans and improved density-based spatial clustering of applications with noise (DBSCAN) algorithm are proposed. Experiments show that improved K-means algorithm gains stable text clustering result and performs better than traditional K-means after eliminating the limitation of random selection of initial k data. The improved DBSCAN algorithm obtains good performance through reasonable parameter setting. Improved DBSCAN performs better than improved K-means. The study is helpful and meaningful for the parallel corpus construction of Chinese and Tibetan texts. Subject Categories and Descriptors I.5.3 [Clustering]: Algorithm; I.2.7 [Natural Language Processing]: Text Analysis General Terms: Algorithm, Performance
منابع مشابه
Tibetan Text Clustering Based on Machine Learning
Tibetan information processing technology has been obtained some achievements. But it falls behind Chinese and English information processing. It still needs to be paid more attention. Text clustering has the potential to accelerate the development of Tibetan information processing. In this paper, we propose an approach of Tibetan text clustering based on machine learning. Firstly, the approach...
متن کاملTibetan-Chinese Cross Language Text Similarity Calculation Based on LDA Topic Model
Topic model building is the basis and the most critical module of cross-language topic detection and tracking. Topic model also can be applied to cross-language text similarity calculation. It can improve the efficiency and the speed of calculation by reducing the texts’ dimensionality. In this paper, we use the LDA model in cross-language text similarity computation to obtain Tibetan-Chinese c...
متن کاملChinese-Tibetan bilingual clustering based on random walk
In recent years, multi-source clustering has received a significant amount of attention. Several multi-source clustering methods have been developed from different perspectives. In this paper, aiming at addressing the problem of Chinese–Tibetan bilingual document clustering, a novel bilingual clustering scheme is proposed, which can well capture both the intralingua document structures and inte...
متن کاملStudy on Hot Topic Discovery from Chinese Texts
With the development of information technology, there has been an increased popularity in the use of electronic texts. Topic detection and tracking can identify hot information from isolated texts. Obtaining hot topics has become an important issue in recent years. The combination of statistics and natural language processing was utilized in the current study to discover hot topics from texts. ...
متن کاملChinese Name Disambiguation Based on Adaptive Clustering with the Attribute Features
To aim at the evaluation task of CLP2012 named entity recognition and disambiguation in Chinese, a Chinese name disambiguation method based on adaptive clustering with the attribute features is proposed. Firstly, 12-dimensional character attribute features is defined, and tagged attribute feature corpus are used to train to obtain the recognition model of attribute features by Conditional Rando...
متن کامل